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ABSTRACT 


One important obstruction against Thai COVID-19 recovery is fake news 
shared on social media that is one of the “Artificial Intelligence Open Issues 
against COVID-19” reported by Montreal.AI. Misinformation spread is one 
of the main cyber-security threats that should be filtered out as the IDS for 
maintaining COVID-19 information quality. To detect fake news in Thai 
texts, Thai-NLP techniques are necessary. This paper proposes a state-of-the- 
art Thai COVID-19 fake news detection among word relations using transfer 
learning models. For pre-training from the global open COVID-19 datasets, 
the source dataset is constructed by English to Thai translating. The novel 
feature shifting is formulated to enlarge Thai text examples in target dataset. 
Machine translation can be used for constructing Thai source dataset to cope 
with the lack of local dataset for future Thai-NLP applications. To lead the 
knowledge in Thai text understanding forward, feature shifting is a promising 
accuracy improvement in fine-tuning stage. 
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NOMENCLATURE 


Ada-SGD 
arXiv.CS 
AWD-LSTM 
BERT 
BEST 
bi-LSTM 
BLEU 
CoAID 
COVID-19 
English-NLP 
GPT 

GRU 

IDS 
iSAI-NLP 
MLP 
Montreal.Al 
NECTEC 
NLP 

OOV 
PRICAI 
PyThaiNLP 
ReCOVery 


Adaptive stochastic gradient descent 

arXiv computer science 

Average-stochastic gradient descent weight-dropped long-short term memory 
Bidirectional encoder representations from transformers 
Benchmark for enhancing the standard of Thai language processing 
Bidirectional long-short term memory 

Bilingual evaluation understudy 

COVID-19 healthcare misinformation dataset 

Coronavirus disease 2019 

English natural language processing 

Generative pre-training 

Gated recurrent unit 

Intrusion detection system 

International Joint Symposium On Artificial Intelligence And Natural Language Processing 
Multi-layer perceptron 

Montreal artificial intelligence 

Thailand's National Electronics And Computer Technology Center 
Natural language processing 

Out of vocabulary 

Pacific Rim International Conference On Artificial Intelligence 
Python packages for Thai natural language processing 

Multimodal repository for COVID-19 news credibility research 
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RNN Recurrent neural network 

SCB-MT-EN-TH Siam commercial bank’s machine translation in English and Thai 

State of AI State of artificial intelligence report 

SVM Support vector machine 

TCI Thai citation index 

Thai-NLP Thai natural language processing 

ULMFiT Universal language model fine-tuning for text classification 

VISTEC.AI Vidyasirimedhi Institute of Science and Technology Artificial Intelligence 


1. INTRODUCTION 

Longer than one-hundred years ago, both Cholera and Spanish flu had been largely spreaded in 
Thailand [1] (as well as the COVID-19 in 2019-2020). Siriraj Hospital-known as the oldest hospital in 
Mekong society (including Myanmar, Laos, Vietnam, Cambodia, and Thailand), was served as the main 
medical hub [2, 3] provided to cue many infected people, during the reign of King Chulalongkorn and King 
Phramongkutklao. During 2019-2020, many world regions had been facing with [3, 4] the spread of new 
pandemic, known as COVID-19. As the main spreading zones in Thailand were also detected (e.g., 
awuuiquiu: Lumpinee Boxing Stadium [5]), many supporting policies were proposed by Thai government 


sectors (e.g., giuqu: Thai Pantry of Sharing [6]). As dedicated to King Chulalongkorn’s foundation [2] and 


Prince Mahidol’s contribution [2] toward the high flourish of Thai public health and medical proficiency, 
Thailand ranked the world’s top best ongoing COVID-19 recovery index [7] by Johns Hopkins Coronavirus 
Resource Center that was ready to be one of the best health and medical tourisms in the world after COVID- 
19 (known as the New Normal). 

Thanks to all Thai health and medical personnel who had been continuously working hard in day 
and night time since December, 2019. Moreover, it could be obviously seen that many world-class medical 
centers (at the same level as hospitals in developed countries) to be treated COVID-19 could be found in 
Thailand, e.g., Siriraj Hospital, Bamrasnaradura Infectious Diseases Institute, King Chulalongkorn Memorial 
Hospital, Ramathibodi Hospital, Panyanunthaphikkhu Chonprathan Medical Center, Songklanagarind 
Hospital, Phramongkutklao Hospital and Chulabhorn Hospital. that could handle almost all Thai and other 
ASEAN patients. From the literature, there were more than 100 original COVID-19 papers from Thai health 
personnel [8] that exposed the discovery of new medical knowledge. To focus on another impact problem 
against cyber-security threats, many COVID-19 fake news and/or spambots (or other types of inaccurate 
information) on social media [9, 10]-composed by fraud social accounts; could be quickly shared/posted 
among millions social users, such as Facebook, Youtube, and Twitter that easily brought about the serious 
health misleading and harmfulness-known as one of the big problems against Thai COVID-19 recovery. 

— Artificial intelligence open issues against COVID-19 

One of the “Artificial intelligence open issues against COVID-19” reported by Montreal.AI [11]- 
known as the world leading group in AI research and innovation, the COVID-19 fake news detection was 
still an open-world issue but it was possible to use NLP [12] to detect the misinformation [13] from the 
content. Many fake news detection tutorials and codes were also available on PapersWithCode [14]. 
Unquestionably, fake news referred to the misinformation mostly in informal posts [15] such as 
misinterpretation, personal bias, influence people. It is important to have the quality evalution of social 
information for maintaining information security by NLP [16], especially in COVID-19 duration. During the 
2m-social-distancing, many state-of-the-art COVID-19 fake news detections were investigated [17]. Since 
both real and fake information has the same spreading pattern. The public hashtags, e.g., "#SocialDistancing"” 
and "#WorkFromHome" [18] with user accounts on Twitter were crawled and analyzed to detect fake news. 
The multisource-based information e.g., contents, user accounts, spread patterns or URL was investigated on 
Twitter [19, 20]. Not only Twitter, Facebook page [21-23] was also found to be one of the largest 
misinformation sources [13] but it was found to be expensive to manually collect them all, instead of 
detection by the textual content. The COVID-19 ontology was designed as the medical knowledge retrieval 
by keywords [24]. Some COVID-19 fake news datasets are available e.g., CoAID [25, 26], ReCOVery [27, 28] 
and FakeCovid [29, 30]. It was obviously seen that arXiv.CS [31, 32] was the largest publication source for 
COVID-19 fake news detection. However, the prior papers were investigated in English texts. As well as 
Arabic and English [33], Thai and English have totally different syntactic writing (e.g., no space between any 
2 Thai words [34-36]), the investigation in Thai texts should be totally categorized as another problem. 

— The COVID-19 fake news detection as one of Thai-NLP applications 

For the studies in misinformation detection in Thai social, a large number of Tweet texts with other 
multi-source-based information [37], e.g., URLs, FriendsCount, FavouritesCount, StatusesCount and 
RetweetCount. Were cleaned, collected and classified by traditional machine learning [38] like SVM [39], 
MLP [40] and Naive Bayes method [41]. However, it could not directly detect the real-time COVID-19 fake 
news using only Thai textual content; it needed some prior multi-source-based information that was 
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expensive to collect all attributes, as a brute-force compatibility for large-scale stream information. 
Practically, the COVID-19 fake news detection using only Thai textual content was totally one of Thai-NLP 
problems that needed Thai specific language’s datasets and technical references (unlike English-NLP 
techniques or others). Thai-NLP and computer have been researching longer than 30 years [42, 43]. Many 
state-of-the-art papers on Thai-NLP had been being published by NECTEC [44-46] research group-known as 
an official research and development organization in Thailand that found the AI for Thai project [47], and 
also the important part of Thai-NLP foundation (e.g., the annual BEST competition [48-50] by NECTEC). 
Recently, one of the well-known local projects was to match similar Thai keywords between the list of 
experts and manuscript keywords for automatic assigning reviewers to read the paper [51] that was the 
collaboration between NECTEC and TCI-known as Thailand’s largest publication metric [52] and index [53, 
54]. The concept behind this project was Thai text classification [55, 56], also as a type of Thai-NLP 
problems. A large number of novel Thai-NLP papers can easily be searched on 2 main conferences (known 
as the largest Thai-NLP archives) as isAI-NLP [57] and PRICAI [59]. There were so many raw Thai textual 
sources with the language usage revolution to build textual datasets, e.g., Pantip, Facebook and Youtube. 
Moreover, PyThaiNLP [58] tutorials with libraries and codes were also available for the beginners. 

Since Thai-written language was a type of unsegment words, many state-of-the art papers were 
proposed to solve Thai word segmentation [34, 35] by bi-LSTM [60], adversarial example [35], pre-training 
model [61] or unsupervised method with optimization [62]. The word segmentation had been still the main 
shortfall in Thai-NLP society [63] that totally affected the correctness of other Thai-NLP tasks [46, 63], e.g., 
part of speech tagging, parsing, text classification, information extraction, semantic role labeling, machine 
translation, sentiment analysis, event extraction and question answering. Ontology with semantic web [64] 
was developed by NECTEC that was mostly used as the knowledge structure [65, 66] formulation of 
unsegmented words and sentences for Thai text knowledge and retrieval [67]. For other original Thai- NLP 
applications, a bullying detection in social Thai opinions could be detected by modified GRU [68]. Thai 
plagiarism detection was done [49, 69, 70] that could be used by TCI soon [51]. The state-of-the-art transfer 
learning with self-attention [71], like BERT [72], ULMFiT [73] and GPT [74] as semantic word embeddings 
were recently implemented in Thai opinion analysis [75] that could be outstretched by Thai synonyms. 

To extend beyond those prior works, in Figure 1, this paper proposes a state-of-the-art Thai-NLP 
formulation on word-level transfer learning (the ongoing NLP advancement [76], reported by State of AI) 
models, as a solution for the serious health misinformation problem against Thai COVID-19 recovery. The 
main contribution of this paper is concluded as: (1) The COVID-19 fake news detection based on transfer 
learning as a Thai-NLP problem that is introduced to be the IDS for filtering the COVID-19 information, (11) 
The source dataset for pre-training stage can be constructed by English to Thai machine translation over 
those global open COVID-19 datasets, (111) The feature shifting is formulated to generated more Thai text 
examples for target dataset. This can be done by synonyms and different Thai written styles that have the 
same semantical content, (iv) The future Thai-NLP based fake new detection and other applications can be 
exploited by using feature shifting to enlarge Thai text examples in target dataset; and machine translation 
from global datasets into Thai as the source dataset. The organization of this paper is categorized into 4 parts. 
Section 2 talks about research method. Measurement and results are in section 3. Finally, section 4 is 
conclusion. 


| Thai Source 
j Dataset 1# 


Pre-training 


Feature Thai-NLP 
Shifting Models 


Pre-training 





Thai Source 


Dataset 





Pre-training 





Thai Text Thai Word Thai-NLP 









Cleansing Segmentation Models 


Thai Source 


=> Translate to Thai ap Dataset 2# 


Open 





Datasets 





TS 


Figure 1. The proposed COVID-19 fake news detection based on Thai-NLP, (a) Prior Thai-NLP works, (b) 
The extension from those prior Thai-NLP works 
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2. RESEARCH METHOD 

For the extension of prior works, Figure 1(a), this paper proposes a state-of-the-art COVID-19 fake 
news detection on Thai texts, based on transfer learning as shown in Figure 1(b) that can be seen as IDS for 
COVID-19 news quality. This section consists of Thai text preparation, English to Thai translating, feature 
shifting and Thai-NLP models based on transfer learning, respectively. 


2.1. Thai text preparation 

Unofficial Thai words/phrases on social media (tweets, posts or comments) can be written in variant 
styles. Thai raw texts can easily produce the noise/outlier for model learning. Thai text preparation is still one 
of the most important Thai-NLP tasks in order to the detection correctness. 


2.1.1. Thai text cleansing 
As to replace the grammatical error and delete the noise/outlier, a raw Thai text is necessary to be 


cleaned such as spelling corrections (e.g., laan, laa, lma, lava), redundant characters (e.g., ladaaaa, 


AZUNWWW) and other irrelevant characters (e.g., #, !, *). Thai text cleansing obviously reduces the irrelevant 


and redundant data from the raw social text, prior to fine-tuning stage. In contrast, some posts or comments 
are such a long Thai text that can be formed to be a multi-semantic text (e.g., 


A A ov A 1 Y w Iı 1 A 5 w wv ' E . . 
AaqeuMNdnANIZyTMNAIUNIZIW 9 laseulduaasnauuAdsnAalasaqu). The multi-semantic text is 
e à . e A A A 1A ' a a w Iı 
manually divided into be many single-sentences (e.g., AARE uNnNónANI YNIA IAANANANNUNISTN 9, 
lasu lunh nisnunsssnu 9 panna werlauaqu) with the “fake” labeling; and trained to the model. 


Moreover, some words (e.g., 721) can be combined with feature shifting (e.g., 72), NAN, lamg) to enlarge 


the size of target dataset. 


2.1.2. Thai word segmentation 
Unlike English, Thai is naturally an unsegmented language that has no space between 2 words (e.g., 


ATUNWUMTIZUIAVEN lnIADIWHUN). Thai word segmentation (or tokenization) is still acknowledged that it is 


one of the main Thai-NLP issues. This problem has been waiting for the appropriate solution. The direct 
dictionary-based word segmentation [34] might not be a good solution for COVID-19 fake news detection. 
As shown in Figure 2, these 4 possibilities of Thai word segmentation are found to be correct in Thai 
grammars but need some Ada-SGD algorithms [34, 36] for the segmented optimization. This paper calls the 
API from PyThaiNLP [77] to segment all words in a sentence or phrase from the primary collection that 
consists of many Thai tokenization engines, e.g., DeepCut and Lextro. 


Input Output 


(1) naa | mu | Ñ | nas | asuna | vas | Ta | Ia | agas | vin 
nsainniinisssurAvaslaiaadmin Thai Word (2) ASS | MW | U | N13 | 3suIA | me Lagaa | asi | nun 
Segmentation (3) ngamu | Ñ | nas | szuIM | vas | Ta | 4a | agas | nún 

(4) agamy | Ñ | nas | szuIM | vas | TAI | agas | nún 





Figure 2. Thai word segmentation (or tokenization) as one of the main problems in Thai-NLP 


2.2. English to Thai translating 

Since there are so many open COVID-19 fake news datasets in Kaggle, those datasets are based on 
English. A large number of global information in Thai language is translated from the world-wide news 
publications that are described in English. The global COVID-19 fake news also has a chance to be translated 
and published in Thai social. The open COVID-19 fake news datasets are also translated to Thai as source 
dataset using SCB-MT-EN-TH translation by VISTEC.AI [78-80] as an external knowledge. The source 
dataset is used to pre-train those transfer learning models (BERT [72], ULMFiT [73] and GPT [74]). 


2.3. Feature shifting 


According to the language nature, Thai is so variant behavioral usage. The variance can be 
categorized into synonyms, and written styles, as shown in Figure 3, called feature shifting. This paper 
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proposes feature shifting in Thai words/contents concerning COVID-19 to enlarge the target data volume 
based on the usage variances; and increase the model correctness. 

There are so many Thai synonyms in Thai usage on social media. In this paper, Thai and COVID-19 
usages words and their synonyms during December 2019 to June 2020 are manually collected in the datasets. 
The 1,362 Thai synonym groups are also stored in a database. Since the fake news can be arbitrarily 
composed using any whatever Thai synonyms (e.g., laa, laña-19, Talsu). It is essential to train more 


generated texts with the same semantical content by replacing those synonyms for producing more data 
variety to make the model perception. 
To best our knowledge, each word from Thai word segmentation is further search 


( search (word,) ); and all synonyms are listed (list...) in order to generate more Thai written styles, 


Thai syn 


modeled by recursive function (1). 


search (word, ) = 


Thai syn 


[sme ={(word,, syn, ), (word, , Syn, ),....(word,, syn, )}; if (list yr, on €P) (1) 


S€ACN yy 4; syn (Word ; ); otherwise 


The same semantical content can be further composed by many Thai written styles as a word 
combination problem (e.g., mszzuIAVEdTaIaod NAAN, NnqaunWIAdaszuiAasaniin) that depend on the 
language written behaviors. As to variant behaviors, it can be combined with Thai synonyms (e.g., 
AIsuWITZUIAVES NSATO ÄUK lMNodNNUAlTUATUNH, msuninsznwveuyelgalalsuiedwwin a Hong). Thai 


COVID-19 fake news detection should handle the written variants in both sentence and phrase. To enlarge 
the target dataset, a number of possible writing styles are generated and trained to the model. 


Input Output 


(1) nisszuIAVadlAIAad WMUANAsaIMNN 


Feature (2) VinainnlaaunsssuIMAs MUN 


ANIMWIINIssZUIAVASLAIAAd IMU Shifting 


(3) nisuWassuIAVaslIsAd UWUS Mad aNUNlUATIMnW 


° etc. 





Figure 3. Feature shifting: Thai synonyms with Thai written styles 


2.4. Thai-NLP models based on transfer learning 

In this paper, transfer learning refers to two important steps of word2vec deep neural training: pre- 
training Thai COVID-19 model and fine-tuning Thai COVID-19 model as shown in Figure 4. Word2vec is to 
make neural network to learn the word relations within a Thai text. The first training is randomly initialized 
the network parameters and learnt by some translated Thai texts. The model’s parameters are further reused 
and learn more other Thai texts crawled from the social media (with feature shifting to increase data volume 
and variety) in fine-tuning. Transfer learning in sequential data (e.g., word-level sequence) and language 
process is mostly applied in transformer-based model [71, 81] and later other sequential models: BERT [72], 
ULMBFiT [73], and GPT [74]. 


Source 
Dataset 

















Transfer Learning 





Figure 4. Transfer learning steps of deep neural training 
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2.4.1. BERT 

Based on Transformer, BERT [72] has a well-known “‘self-attention” mechanism that enables neural 
networks highlight the essential weights for each token within a sequence. BERT is a bi-directional 
representation that use both left and right context in all network layers. Normally, transformer is designed for 
sequential pre-trained model (known as “masked language modeling”). The output of first word is still used 
as a parameter in the next steps for language modeling. For the configuration, the BERT is set as the 768- 
sized hidden nodes in each layer, 12 attention heads as amount of 110M parameters. 


2.4.2. ULMFiT 

ULMEFiT [73] is proposed to be transfer learning for language processing as well as ImageNet pre- 
trained model in computer vision. ULMFiT inherits AWD-LSTM [82] that has 3 stages: pre-training, 
discriminative fine-tuning and classifier fine-tuning. ULMFiT shows the state-of-the-art text classification 
performance under the low processing resource. Moreover, many novel techniques are still proposed to solve 
the catastrophic interference problem. The model has 3-layer AWD-LSTMs with the embedding size as 400 
and hidden node size as 1,150 in each hidden layer. 


2.4.3. GPT 

GPT [74] demonstrates state-of-the-art results in many NLP applications. Since Transformer can 
tackle with the long-range dependencies in sequential data (better than RNN), it can take in the entire 
sequence as once time. GPT is also a transformer-based model as well as BERT but it is such unidirectional 
feed-forward architecture. For this work, the model is configured as 12 transformer layers with 12 attention 
heads and 768 dimensional states. 


3. MEASUREMENT AND RESULTS 

Based on Thai-NLP measurement, this section demonstrated the promising results of the proposed 
Thai COVID-19 fake news detection according to the paper contributions (as one of the “Artificial 
Intelligence Open Issues against COVID-19” by Montreal.AI [11]). In summary, the pre-trained models were 
fine-tuned into Thai COVID-19 fake news classifiers based on both pre-training and fine-tuning Thai 
COVID-19 datasets. These transfer learning models (BERT [72], ULMFiT [73] and GPT [74]) on Thai texts 
were run on Tesla V100 GPU on GCP with Colab environment. This section could be categorized into 3 sub- 
sections: dataset construction, pre-training Thai COVID-19 models and fine-tuning Thai COVID-19 models. 


3.1. Dataset construction 

According to the data source from COVID-19 news open datasets (as source dataset) and the 
crawled Thai texts from social media (as target dataset), the labels could be fake and real. This section talked 
about the dataset construction. The datasets in transfer learning could be divided into 2 types. 


3.1.1 Source dataset by machine translation 

The pre-training Thai COVID-19 models was trained by the 123,762 Thai-translated single texts 
from source dataset (as described in section 3.2). For the data collection, the global COVID-19 fake (and 
real) news in English from well-known open datasets: CoAID [25, 26], ReCOVery [27, 28] and FakeCovid 
[29, 30] were collected and translated to Thai using SCB-MT-EN-TH translation by VISTEC.AI [77-79]. To 
evaluate the quality of English to Thai translation, the BLEU [83] was used and compared to other well- 
known Eng-to-Thai machines: AI for Thai by NECTEC [47] and Google Translate. The results demonstrated 
that the transformer-based SCB-MT-EN-TH translation outperformed other neural-based machine 
translations for Eng-to-Thai translation in any text length variation, based COVID-19 open datasets, as 
shown in Figure 5. 


3.1.2. Target dataset with feature shifting improvement 

The language network models were fine-tuned by sharing the pre-trained weights learnt by 123,762 
Thai texts from the source dataset. For the target dataset construction, the local COVID-19 fake (and real) 
news in Thai from posts and/or links shared on Facebook, Twitter and fake URLs were collected during 
December 2019 to June 2020. All collected information was extracted into 45,372 single Thai texts as pre- 
training Thai COVID-19 dataset by Thai text cleansing and Thai word segmentation, respectively. According 
to Thai variant usage in social media, the target dataset is enlarged to 73,280 Thai text examples with the 
OOV as 3.26%. These texts were generated by feature shifting. The target dataset was used in Fine-tuning 
Thai COVID-19 models (section 3.3). 
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Feature shifting also helped to enlarge the size of Thai text examples into target dataset train on 
fine-tuning models. Figure 6 demonstrated the accuracy enhancement by feature shifting in all transfer 
learning models for Thai COVID-19 fake news classification. GPT provided the highest improvement by 
feature shifting as 36.43%. 








30 4 Improvement by Feature Shifting (FS) 
m 25.67% 
25 ] m 29.75% 
= m 36.43% 
20 + 70 
> 
> œ 60 
uJ l 
= 15 3 50 
< 
ao 
10 4 g” 
—¢— SCB-MT-EN-TH by VISTEC.AI 3 30 
- —f- Google Translate £ 20 
u 
à “i Al for Thai by NECTEC 10 
, T T = = ’ 0 
(0,5) (5,10) (10,15) (15,20) (20,25) NoFS FS NoFS FS NoFS FS 
Length of Text Translation BERT max(seq)=256 ULMFiT GPT max(seq)=256 
Figure 5. Comparison between English to Thai Figure 6. Transfer learning improved by feature 
machine translations based on global COVID-19 shifting in fine-tuning stage 


open datasets 


3.2. Pre-training Thai COVID-19 models 

The source dataset was proposed to train the pre-training That COVID-19 models that could be 
splited into 3 partitions: training (70%), validation (15%), and testing (15%) as well as [80] configuration. 
From Table 1, ULMEFiT ran the least pre-training time. Since the ULMFiT was based on AWD-LSTM [82] 
and well-designed for pre-trained model, especially in state-of-the-art Thai text classification under the low 
processing resource. And ULMEFiT also had the lowest training loss based on the 123,762 Thai text examples 
from source dataset that was constructed by SCB-MT-EN-TH translation on those open COVID-19 datasets 
(CoAID [25, 26], ReCOVery [27, 28], and FakeCovid [29, 30]). 


Table 1. Comparison between pre-training Thai COVID-19 models 
Transfer learning model _Pre-training time (hours) Loss 


BERT encase 22 4.0719 
ULMFiT 12 3.8517 
GP Traa 19 4.5365 


3.3. Fine-tuning Thai COVID-19 models 

From Table 2, the fine-tuning Thai COVID-19 models were shared parameters from pre-trained 
models and learnt more 73,280 Thai text examples from target dataset. The partition between training and 
testing was set as 70:30 [80]. Although GPT had the highest improvement by feature shifting, ULMFiT was 
well-performed in different writing styles and synonyms in Thai texts that gave highest accuracy based on 
this target data. Although BERT was a bidirectional model that could leverage both left and right words 
(called tokens) for embeddings [72, 75], ULMFiT contrastly achieved higher performance, especially for 
Thai feature shifting. 


Table 2. Comparison between fine-tuning Thai COVID-19 models 
Transfer learning model _Fine-tuning time (minute per epoch) Accuracy 


BERT imaxiseq)=256 4 0.6286 
ULMEiT 2 0.7293 
GPT wansacase 4 0.6819 
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4. CONCLUSION 

As it relates to the problem stated fake news against Thai COVID-19 recovery, this paper addresses 
fake news detection in Thai texts based on transfer learning, using Thai-NLP techniques. Transfer learning 
models consist of BERT, ULMFiT and GPT; where ULMFiT is shown to be higher performance for Thai 
COVID-19 fake news detection. With the help of English to Thai machine translation, the COVID-19 open 
datasets in English can be leveraged to construct the source dataset for pre-training. Feature shifting is 
proposed to enlarge the target dataset for fine-tuning that absolutely improves the classification accuracy in 
all transfer learning models, where GPT has the highest improvement rate. For the limitation, Thai text 
noisy/outlier are still a main obstacle in the full-supervised learning. To label all million Thai texts without 
noisy is totally impossible. Mislabeling easily comes with the full-supervision in large volume that also 
makes the model learnt some wrong labels. The future Thai fake news detection should be semi-supervised 
learning that has 2 parts: (1) Firstly trains only the high quality of labeling in some Thai texts as partial- 
supervised model and (41) Automatically label those unknown Thai texts by the partial-supervision. Not only 
the wrong labeled prevention but also long speed of full-training is problems in full-supervision. To extend 
this work, multi-task transfer learning with Thai synonyms can be used for Thai language understanding to 
make model learnt by many tasks from multi-datasets. 
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